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ON STOCHASTIC APPROXIMATION 



A bstract 

This paper deals with a stochastic process for the approx imat ion of the 
root of a regression equation. This process was first suggested by Robbins 
and Monro [ll. 

The main result here is a necessary and sufficient condition on the 
iteration coefficients for convergence of the process (convergence with 
probability one and convergence in the quadratic mean). 
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ON STOCHASTIC APPROXIMATION 



1. Introduction and Summary 



In their classical paper Robbins and Monro [l] treated the following 
problem. 

Let F(y|x) be a family of distribution functions depending upon a real 
parameter x , -*> < x < + °° , and let M(x) , 



M(x) 




) 



be the corresponding regression function. It is ass'imei that M(x) and 
F(y|x) tire ’in known to the experimenter vho can, however , take observations 
on F(y|x) for any value x . Robbins and Monro gave a method for solving 
s tochastically the regies cion equation 

(1) M(x) = a , 

where a is a given number. Under certain conditions on K(x) they were 
able to construct an iteration procedure such that converges in 

probability to the (unique) root 0 of (l). 

This "Robbins -Monro procedure" is defined as follows. Let {a^} be a 
fixed sequence of positive numbers such tuat 

CO CO . 

(2) £ a -.co £ a < « . 

t n 9 _ n 

n^l n^l 

The iteration procedure i'* then defined recursively as the nonstationary 
Markov chain given by 

(3) X ml = X n - a n (Y n - «) , F(X 1 = a € R 1 ) - 1 , 
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where Y is a random variable distributed according to F(y|x = X r ) or, 
in another notation, Y is a realization of the random variable Yftt ) . 

Later several authors (e.g., Blum [2], Dvoretzky [ 3 ]) have shown that 
even under weaker conditions than those imposed in [l] on M(x) , the Robbins - 
Monro process also converges with probability one and in the quadratic mean. 

I 11 this paper wo deal with the question of whether it is possible to 
relax the parameter condition (2). The main result is that the condition 

00 

(H) a -> 0 (n -> co) , 1 a - » 

n=l 

in connection vith certain assumptions on M(x) > is necessary and sufficient 
for convergence with probability one and in the quadratic mean. Furthermore, 
the proof of convergence seems to be more elementary than proofs given by 
earlier writers. 



2. Lemmas 

In this section we state and prove two Lemmas which will be needed for 
the proof of Theorem 1 and Theorem 2 given in section 3. 

Lemma A. Let (a. ) be a sequence of real numbers. Then 



11 1 * * * 

« a. II (1 - a.) = 1 - H (1 - a ) , 

.1=1 1 j=i+l " i=l 



n > 1 



Lemma B . Let (a.) be a sequence of positive numbers satisfying the condition 



a > 0 (n -> «) , E a = « 
n v ' ' , n 

n=l 



Then 



n 2 n 



2 a II (1 - a.)' 
i=l 1 j=i-l 0 



(n -> ») 



O 

ERIC 



^Throughout this paper the factor of the last term of such a sum equals one. 
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Leirnna A can easily be verified by induction, 
that for any C > 0 there exists an integer 



a < G , 0 < 1 - 

n 



To prove Lenina B *e note firs^ 
Kq = Nq(g) such that, for all 

a < 1 
n 



Hence we get the inequality 



n p n p n 

s A n (! - O < n (1 - a.) 
i=l j=i+l J i=N 



V 1 , V 1 

Z at II (1 - a.) £ 
i=l 1 j-i+l J 



+ e z a. n (l 
i=N 0 j=i+l 



ihe factor of G is less than one by virtue of Lemma A. Because of the diver- 
gence of £a n there exists for any M > 0 an integer = N^(G, M) such 



that, for all n > , 



n 

H (1 - a. ) < G-M" 

« .. i' 

1=i, o 



If ve denote 



H o _1 ? N o' i p 

Z af II (1 - a.) - M 

i=l j=i+l J 



it follows immediately that 



n g n 



2 a. JI (l - a.) < 2G for all n > N 



i 

i=l j=i+l 
This completes the proof of Lemma B. 



- "1 ' 



3 > Stochastic Approximation of the Koot of a Regression Equation 

Let us assume that the regression function M(x) corresponding to the 
family of distribution functions F(y]x) , satisfies the following conditions: 
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(5) c^x - 0 1 < |M(x) - a| < c 2 |.c - e| + , c £ > c 1 > 0 , c 3 > 0 ; 

(6) M(x ) < n x < 0 

M(x) = Ct for x = 0 
M(x ) > a x > 0 

The variance of Y(x) is supposed to be uniformly bounded in x , 

(j) .ar Y(x ) 5 c 1j < 00 

Then we state the following theorems. 

Theorem 1 . If conditions (4) through ( 7 ) hold, th>n the stochastic process 

(X ) given by (3) converges to 0 with probability one and in 

2 

* the quadratic mean, -> 0 w.pr. 1 , E(X^ - 0) -> 0 (n -> «) . 

If we replace condition (5) by 

(5' ) c^x - o| < 1m(x) - al < Cg|:< - 0| , ^>0^0 , 

and if we add the assumption that in a neighborhood of 0 Var Y(x) does not 
vanish, 

(8) Var Y(x) > > 0 for all x e (x| (x - o\ < 8 , 5 > 0} , 

then the parameter condition (4) is even necessary and sufficient for the con- 
vergence of (X^) to 9 • 

Theorem 2 . If conditions (5 f ), (6), (7), (8) hold, then {X^} converges to 
0 with probability one and in the quadratic mean if and only if 
the parameter sequence fa^) fulfills condition (4). 
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p 

Proof of Theorem 1 . We derive a recursion formula for the sequence = E(X n - &) 
Prom (3) we have 

E r»! - E <Vl - ■ E[ *„ - 9 • \«n - 

= E n - 2a n E[ ( x n - - a|x = X n )3 + a^E(Et (y(x) - a) 2 |x = xj) . 

Because of (5) and (6) it follows that 



R[(x n - e)E(y(x) - Q|x == X n )] = E[(X n - 0)(M(X n ) - Cl)] 

= E[|x n - e||«(x n ) - a|] > c^X t - of > o . 

From (5) and (7) wo get 

E(E[(Y(x) - a) 2 |x = x ft ]) = E { E [ ( Y {>: ) - M(x) + M(x) - a) 2 jx - X n _i } 

= E{Var Y(X n ) + (H(X n ) - uf) < Etc )+ + c 2 (X n - of + - g\ + cp 

^ + c^CgCj + c ^ + (c^ + ^OpCj)E(X^ “ 0) 

Using these inequalities and setting 

2 2 2 
% ^ + ^3 = 1 *~2 2c ^ c 3 = ? 

it follow z at once that 






Pc ..a 
1 n 



2 2 

+ c, 7 a )E + c r a 
J n / n 6 n 



Because of the convergence of (a } to zero and c r _ > c, there exists for 

n f - 1 

each constant Cg , 0 < Cg < , an integer such that for all n > 



1 



Pc^a + c^a' - < (l - 
In f n v 



Cna ¥ 

o rP 
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This yields the more convenient inequality 



nH-1 



< (1 



c n a ) E 



+ c^a , n > N 0 
n 6 n — 2 



Adding up this inequality from to n we find 



(10) E n + 1^ E N " (1 - c 8 a f + c 6 l f 5 (1 - c Q a f 

X w 2 i-W 2 1 ° i=N 2 ° J 



The first term of the right-hand side of {10) converges to zero since E is 



finite and 



n 

II 

i-N 



ii (i - c g a ^) o ( n ry 0 



2 



because of the divergence of £a^ . Because of Lemma B the same holds for 
the second tern, 



2 n 2 -2 ^ ^ n 2 

a i 11 (1 - c ft a -i) = ( c p)’ ■ ( c R a J 11 (1 - c R a J ) -» 0 (n --> « 

=N 2 1 J-i+1 b J y i=Ng B 1 J-i+1 5 J 



This concludes the proof _f convergence in the quadratic mean. 

i 

To show that (X^) converges also with probability one ve use a method 
which is similar to that employed by Dvoretzky [j]. Vie derive the convergence 
with probability one from the convergence in the mean. 

Eor any pair € > 0, 5 > 0 , there exists an integer such 

that, for all n > , 

E n = E{X n - of < €b 2 . 

We modify the sequence (X^) : 
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(II) X' = X for all n < N, 

x ' n n - 5 



( 12 ) 



x' =i 
ml 



X 1 - a (Y - a) 
n n v n ' 



X* 

n 



if 



K - <6 i 

otherwise 



j * 



n ^ i; . 



Y denotes now a realization of the random variable Y(x •- X f ) instead of 
n v u 

Y(x = X ) . 
n / 

Equations ( 1 1 ) and (l2) imply that also 



0-3) 



E(X 1 - e) < tb for all n > W, 
n - 5 



If jx . - o\ > b for any j > N, , it follows from (12) that |x f - 0 j >6 
3 J? n 

for all n > j > and we obtain, for a, 11 n > , 

P { max |x . - ?! > 5) < P{|X' - o| > 5) 

N 5 < jin 0 n 

Together with (13) this implies that { X ) converges with probability one to 



0 , i-e.. 



P(sup |x . - el > 5) < e 



o>n 5 



This completes the proof of Theorem 1. 



Proof of Theorem 2. 



Since Theorem 1 implies the sufficiency 



of parameter condition h, it remains only to prove that the parameter condition 
(4) is necessary for the convergence of (X^) to 0 . We assume that the 
sequence (E^) converges to zero even in the case when we use a parameter 
sequence f a^) which does not satisfy condition (4). We show that this assump- 
tion yields a contradiction. 



o 

ERIC 
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The parameter sequence (a^} under consideration has to fulfill exactly 
one of the following conditions: 



(a) Z a. < 00 

i=l 1 



(b) there exists a subsequence {a ] and a constant L > 0 such that 

i 

a > L > 0 for all i . 
n . — 

1 

From the asserted convergence of E to zero it follows --an we have seen-- 

0 n 

that converges to 9 with probability one. Tnerefore ? n & because of (8) 

there exists an integer such that, with probability cue, 



min Var Y(X ) > c c > 0 

In the parameter case (a), which implies > C (n "► ») , ve can further 
assume that is so large that 

0 < 1 - + c p a n ^ ^ for a -^ n ~ ^4 



Hence it follows from (9) by similar arguments to those used before tnat 



K . > E - 2c 0 a T + a 2 [E(Var Y(X )) + c?E ] 
m-1 - n 2 n n n ‘ n /y In 



2 p p 

> (X - 2c^a + c.a r )K + c c *a’ 
- v d n Inn 5 n 



n > II 



Again there exists for each c 0 > an integer N. = N c (c^) > Ih such that 

9 c. ? j j ~ 4 

£ 2 

E , > {l - c r a ) E + c r a for all n > N c 

m-3 - v 9 iv n 9 n - 5 

jicnco we get for the parameter case (a) 
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n 



E 



n ^ n 



n+l - E N . 11 ^ c 9 a i^ + c 5 , S a i ? d " G 9 a h - 

5 i=N,- i=N,- j=i+l J 



where 



n 2 

\ a i = c 10 ' 

1=N 5 



f = H (1 - c„a.) c 
j-N 5 +l 9 0 



is greater than zero because of the convergence of Za^ . Hence we have 
E n ~ c 10 > ^ ^ 0r n ^ , which implies the desired contradiction. 

In case (h) we get the ..ntradiction immediate 3y by considering the 
sequence of inequalities 



E *> c^a > c c • L > 0 
n. — ) n, - 5 

i i 



for all n^ > 



This completes the proof of Theorem 2. 

4 . Concluding Remarks 

The crucial assumptions which lead to the weakening of the parameter 
condition (2) are the two assumptions contained in (5) and (5*)> respectively. 
One of the assumptions in (5); 



(i4) 



|m(x) - a| < c 2 lx - ol + Cj 



cannot be relaxed as it was pointed out, e.g., by A. Dvoretzky ([;>!, p. 51)- 
However, it might be interesting to know if the validity of Theorems 1 and 2 
is affected by weakening the other assumption me le in (t) and (5'), 



o 
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(15) 



-’Jx - o| < |m(x) - a| 

11 
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ln particular , we may ask if it is possible to replace ( 15 ) by the usual 
condition (e.g., Blum [£], p. 581) 

inf |m(x) - a| > 0 for every pair of numbers 

5^ < |x - 6 1 < with 0 < 03 

In practice , however, condition (l4) and (15) will cause no trouble, because 
in almost all instances the experimenter knows that the root 9 lies in some 
finite interval [C^,C*] . Therefore he can replace the iteration procedure 
(5) by the bounded stochastic approximation process 



X 



a n< Y n - a ) < C . 



n+1 



a (Y 
n : 



a) if G < X - a (Y - a) < C* 
' * - n n v n - 

C* 



X - a (Y * a) 
n iv n 



In this situation (l4) and ( 15 ) do not seem very restrictive. 
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